Gene gain facilitated endosymbiotic evolution of Chlamydiae

Chlamydiae is a bacterial phylum composed of obligate animal and protist endosymbionts. However, other members of the Planctomycetes–Verrucomicrobia–Chlamydiae superphylum are primarily free living. How Chlamydiae transitioned to an endosymbiotic lifestyle is still largely unresolved. Here we reconstructed Planctomycetes–Verrucomicrobia–Chlamydiae species relationships and modelled superphylum genome evolution. Gene content reconstruction from 11,996 gene families suggests a motile and facultatively anaerobic last common Chlamydiae ancestor that had already gained characteristic endosymbiont genes. Counter to expectations for genome streamlining in strict endosymbionts, we detected substantial gene gain within Chlamydiae. We found that divergence in energy metabolism and aerobiosis observed in extant lineages emerged later during chlamydial evolution. In particular, metabolic and aerobic genes characteristic of the more metabolically versatile protist-infecting chlamydiae were gained, such as respiratory chain complexes. Our results show that metabolic complexity can increase during endosymbiont evolution, adding an additional perspective for understanding symbiont evolutionary trajectories across the tree of life.


Supplementary Discussion 1 -Inconsistencies in Chlamydiae phylogenetic relationships
The Parilichlamydiaceae are fish pathogens with reduced genomes that resemble the animal pathogen Chlamydiaceae 1 , and yet they have been placed sister to all other Chlamydiae in species reconstructions [1][2][3][4] . However, in our analyses of concatenated marker genes the phylogenetic position of Parilichlamydiaceae was unstable, indicating possible long-branch attraction (LBA) artifacts 5 (Figures S3-S4). Counter to prior suggestions of convergent evolution, Parilichlamydiaceae and Chlamydiaceae formed a monophyletic group in a Bayesian phylogeny of PVC 16S rRNA genes ( Figure S6). Despite their interesting biology and likely importance in Chlamydiae evolution, we were unable to confidently resolve the position of Parilichlamydiaceae and thus chose to remove them. Likewise, the phylogenetic placement of the recently described orphan lineage Chlamydiae bacterium 1070360-7 4 was inconsistent and removed from further analyses ( Figure S3).
Several long-branching chlamydial lineages formed a clade sister to other chlamydiae in initial species trees, but were found to be well-supported together with Simkaniaceae in Bayesian inferences with the CAT model of evolution (Figures 1 and S3-S5), which is known to alleviate LBA artifacts caused by fast-evolving sequences 6,7 . In maximum-likelihood (ML) phylogenies the position of these Simkaniaceae-like lineages was also reconstructed as forming a monophyletic group with Simkaniaceae, but only with the removal of compositionally heterogeneous sites. The removal of such sites from sequence alignments reduces systematic error in phylogenomic analyses by alleviating the artifact of species grouping together based on shared biases in amino acid composition 8 . Based on these results, we thus classified this Simkaniaceae-like group as putatively belonging to the Simkaniaceae family and included it as such for our ancestral reconstruction (Figures 1 and S3-S5). A similar pattern was seen for Chlamydiae bacterium 3300009703-49 which initially affiliated with Anoxychlamydiaceae (formerly Anoxychlamydiales 9 ), but was well-supported together with Chlamydiae Clade III (CC-III) once accounting for compositional heterogeneity (Figures 1 and S3-S5). CC-III and Anoxychlamydiaceae form a well-supported clade we refer to here as the order Anoxychlamydiales. Likewise, the position of Chlamydiae bacterium K940_chlam_8, another long-branching lineage, is supported with methods accounting for LBA and compositional bias artifacts (Figures 1 and S3-S5).
In the final dataset of 180 PVC bacteria genomes, all deep evolutionary relationships were consistently resolved in both Bayesian and ML inferences when compositionally heterogeneous sites were removed. However, in the Bayesian analysis the topology within Chlamydiaceae differed (chains 1 and 3 converged, and chains 2 and 4 converged). For the ancestral reconstruction we used the consensus Bayesian phylogeny (chains 1 and 3) with stronger convergence and where the topology in Chlamydiaceae was consistent with ML inferences.

Supplementary Discussion 2 -Taxonomic descriptions and adjustments
Previous suggestions to divide the phylum Chlamydiae into several orders have not been accepted by the Subcommittee on the Taxonomy of Chlamydiae of the International Committee on Systematics of Prokaryotes (ICSP) due to the lack of data 10,11 , but high quality data for many diverse chlamydiae have become available within the past years 2-4 . By including this additional data and a large number of non-chlamydial PVC genomes (n=89), a stable chlamydial species phylogeny could be inferred with Bayesian and ML tree inference methods (see Methods, Figures 1, S1 and S3-S6, Extended Data Figure 1, Data S1-S6). However, this was only possible after removing two family-level lineages with unstable branching patterns (Candidatus Parilichlamydiaceae and Candidatus MCF-D; see Supplementary Discussion 1) from the final dataset. Based on the single-copy marker gene species tree inferred in this study and lately published chlamydial phylogenies for marker genes and the 16S rRNA gene, we propose the reclassification of the order levels within the phylum Chlamydiae 3,4,12 .

Reclassification of the order Chlamydiales ord. corrig. (Kuo, Horn and Stephens, 2011)
The order Chlamydiales is so far the only officially accepted one within the Chlamydiae and includes all previously described family-level lineages 11,13 . Based on the stable monophyletic branching in concatenated marker gene and 16S rRNA gene tree inferences, we propose to reclassify the Chlamydiales and to only include members of the Chlamydiaceae 14 , Candidatus Clavichlamydiaceae 15 , and Candidatus Sororchlamydiaceae (formerly Chlamydiae Clade IV or CC-IV) 3,16 in this order. The family Anoxychlamydiaceae represents a distinct monophyletic lineage as supported by concatenated marker gene and 16S rRNA gene trees. Members of this family so far are only represented by metagenome-assembled genomes 3,4 . Members of this family encode the arginine deiminase pathway, [FeFe]-hydrogenase, and a pyruvate:ferredoxin oxidoreductase indicating an obligate anoxic lifestyle for these organisms. In addition many oxygen-dependent genes are missing in the genomes of the Anoxychlamydiaceae. Figure S1. Genus-level dereplication of non-Chlamydiae PVC genomes based on genome quality score. The phylogenetic tree, for representation purposes, is based on 120 bacterial single-copy marker genes 21 inferred with FastTree v2.1 22 . Clades are coloured by GTDB 21 assigned order rank. Black leafs in the species tree represent the selected genomes for downstream analysis, while white leafs were discarded because of higher quality genomes in the same genus. The inner bar chart depicts the genome quality score of genomes in genera with more than one member. The middle chart represents % GC deviation from 50, the outer chart depicts the genome size. See also Data S1-S2. Figure S2. a. Boxplot of putative marker genes for inferring species trees and the length of the corresponding trimmed protein alignment. Marker genes are split into larger COG categories with symbols indicating whether it was removed during manual tree refinement (orange triangles), removed during discordance filtering (pink crosses), or selected for use in concatenated species phylogenies (green circles). Center lines in the box-and-whisker plot represent median values, box limits represent upper and lower quartile values, and whiskers represent 1.5 times the interquartile range above the upper quartile and below the lower quartile. Number of genes included in the box-andwhisker plots from top to bottom n=(4, 15, 12, 85). b. Marker gene COGs that passed manual tree refinement are ranked according to discordance score 23 . The red line indicates the fraction of marker genes (n=5) that were removed based on having the largest discordance from other trees. See also Data S3. Figure S3. Maximum-likelihood (a,b) and Bayesian consensus (c,d) species phylogenies of concatenated single-copy marker genes from all PVC bacteria representatives (n=184 taxa) were inferred with the complete alignment (a,c) and with compositionally heterogeneous sites removed (b,d). High support for bipartitions is indicated by a black circle, with support measured by non-parametric bootstraps (BP) and the transfer bootstrap expectation (TBE) inferred using IQ-TREE with the LG+C60+F+Γ4-derived PMSF approximation, or posterior probability (P) inferred with Phylobayes using the CAT+GTR+Γ4 model of evolution. Consistent clades are collapsed with taxonomy indicated and chlamydial families coloured. Taxa with unclear phylogenetic affiliation are coloured red. Run characteristics and converged chains are indicated for Bayesian phylogenies in a grey box (c,d). Scale bars indicate the number of substitutions per site. e. The 1% most compositionally heterogeneous sites were removed incrementally starting from the initial alignment. Boxplots show the log range of χ2 test scores (scale on the left) across taxa from alignments with the corresponding percentage of sites removed. The red line (scale to the right) indicates the number of taxa with significant χ2 test scores, and hence deviation from patterns of amino acid composition found in other taxa. Center lines in the box-and-whisker plot represent median values, box limits represent upper and lower quartile values, and whiskers represent 1.5 times the interquartile range above the upper quartile and below the lower quartile. Arrows indicate the initial alignment (a,c) and the percentage of alignment sites removed for no remaining taxa with significant deviations in composition (b,d). Changes in BP support for the monophyly of different groups of interest is shown in the bottom line plots, at different intervals of percentage sites removed. Abbreviations and short forms include: Chlam b. (Chlamydiae bacterium), G1 (Group 1), G2 (Group 2), and CC-III (Chlamydiae Clade III). See also Data S4-S6. Figure S4. Maximum-likelihood (a,b) and Bayesian consensus (c,d) species phylogenies of concatenated single-copy marker genes from PVC representatives with the removal of Chlamydiae bacterium 1070360-7 (n=183 taxa) were inferred with the complete alignment (a,c) and with compositionally heterogeneous sites removed (b,d). High support for bipartitions is indicated by a black circle, with support measured by non-parametric bootstraps (BP) and the transfer bootstrap expectation (TBE) inferred using IQ-TREE with the LG+C60+F+Γ4-derived PMSF approximation, or posterior probability (P) inferred with Phylobayes using the CAT+GTR+Γ4 model of evolution. Consistent clades are collapsed with their taxonomy indicated and chlamydial families coloured. Taxa with unclear phylogenetic affiliation are coloured red. Run characteristics and converged chains are indicated for Bayesian phylogenies in a grey box (c,d).

SUPPLEMENTARY FIGURES
Scale bars indicate the number of substitutions per site. e. The 1% most compositionally heterogeneous sites were removed incrementally starting from the initial alignment. Boxplots show the log range of χ2 test scores (scale on the left) across taxa from alignments with the corresponding percentage of sites removed. The red line (scale to the right) indicates the number of taxa with significant χ2 test scores, and hence deviation from patterns of amino acid composition found in other taxa. Center lines in the box-and-whisker plot represent median values, box limits represent upper and lower quartile values, and whiskers represent 1.5 times the interquartile range above the upper quartile and below the lower quartile. Arrows indicate the initial alignment (a,c) and the percentage of alignment sites removed for no remaining taxa with significant deviations in composition (b,d). Changes in BP support for the monophyly of different groups of interest is shown in the bottom line plots, at different intervals of percentage sites removed. Abbreviations and short forms include: Chlam b. (Chlamydiae bacterium), G1 (Group 1), G2 (Group 2), and CC-III (Chlamydiae Clade III). See also Data S4-S6. Figure S5. Maximum-likelihood (a,b) and Bayesian consensus (c,d) species phylogenies of concatenated single-copy marker genes from PVC representatives with the removal of Chlamydiae bacterium 1070360-7 and members of the Parilichlamydiaceae family (n=180 taxa) were inferred with the complete alignment (a,c) and with compositionally heterogeneous sites removed (b,d). High support for bipartitions is indicated by a black circle, with support measured by non-parametric bootstraps (BP) and the transfer bootstrap expectation (TBE) inferred using IQ-TREE with the LG+C60+F+Γ4-derived PMSF approximation, or posterior probability (P) inferred with Phylobayes using the CAT+GTR+Γ4 model of evolution. Consistent clades are collapsed with their taxonomy indicated and chlamydial families coloured. Run characteristics and converged chains are indicated for Bayesian phylogenies in a grey box (c,d). Scale bars indicate the number of substitutions per site. e. The 1% most compositionally heterogeneous sites were removed incrementally starting from the initial alignment. Boxplots show the log range of χ2 test scores (scale on the left) across taxa from alignments with the corresponding percentage of sites removed. The red line (scale to the right) indicates the number of taxa with significant χ2 test scores, and hence deviation from patterns of amino acid composition found in other taxa. Center lines in the box-and-whisker plot represent median values, box limits represent upper and lower quartile values, and whiskers represent 1.5 times the interquartile range above the upper quartile and below the lower quartile. Arrows indicate the initial alignment (a,c) and the percentage of alignment sites removed for no remaining taxa with significant deviations in composition (b,d). Changes in BP support for the monophyly of different groups of interest is shown in the bottom line plots, at different intervals of percentage sites removed. Abbreviations and short forms include: Chlam b. (Chlamydiae bacterium), G1 (Group 1), G2 (Group 2), and CC-III (Chlamydiae Clade III). See also Figure 1 and Data S4-S6.    For comparison, we defined the ALE frequency cutoffs at P ≤ 0.3 ≤ 0.5 ≤ 0.7 as sensitive, specific, and very specific, respectively. a. Plots showing intersections of gene families (y-axis) inferred to be present in early PVC ancestors using different reconstruction methods and mapped to a schematic phylogenetic tree based on Figure 1. The X-axis depicts the total inferred gene families per method and ancestor, respectively. Comparisons (bd) of inferred gene content in PVC ancestors (n=179). b. Boxplots depict the percentage of gene families inferred using both Count and the respective ALE cutoffs relative to all inferred gene families per PVC ancestors with both methods. Purple line indicates the median value of the sensitive cutoff. c. Percentage of uniquely inferred gene families in all PVC ancestors with the respective ALE cutoff in comparison to Count. The purple line indicates the median value of the sensitive cutoff. d. Total unique gene families inferred with ALE and Count with the respective ALE cutoffs. Center lines in the box-and-whisker plot represent median values, box limits represent upper and lower quartile values, and whiskers represent 1.5 times the interquartile range above the upper quartile and below the lower quartile. See Data S7-S8 and repository files for raw ALE and Count output. Figure S10. Overview of ancestral events and inferred proteome sizes across all PVC nodes, alongside node numbers. Only gene copy and event numbers with reconciliation frequencies ≥ 0.3 were considered (see Methods). Barplots at each branch indicate origination, transfer, and duplication events in the positive direction (grey bars; see legend), and loss events in the negative direction (red bars), while circle size represents inferred ancestral proteome size (i.e., number of protein-coding gene copies) for each node to the right. The maximum bar size is 1000, several cases with larger numbers of events are capped at this size (in non-Chlamydiae PVC nodes). Taxonomic groups are indicated and chlamydial families coloured. Key ancestors are indicated and abbreviations can be found in Data S7 alongside event counts for each node.   (14,16,14,16,4,6,6,8,17,19,12,14,12). Nodes and branches are coloured by family assignment according to the legend. Abbreviations of key ancestors are labeled as in Figure 3 (see Data S7). See also Data S8 and repository.

SUPPLEMENTARY DATA
Data S1. Overview of initial prefiltered PVC bacteria dataset and genome representative selection. Includes genome quality and if genomes were selected for downstream analysis for both nonchlamydial PVC bacteria (a) and Chlamydiae (b), representative genomes after species-level dereplication (c), and subsequent selection of non-Chlamydiae PVC bacteria genus representatives (d).
Data S2. Overview of PVC genome representatives selected for use. Includes taxonomy, source information (e.g., Genbank identifier), species and strain names, genome identifiers used in species phylogenies and for ALE analyses, and genome characteristics.
Data S3. Selection of single-copy marker genes. NOG identifiers and annotations, alongside trimmed alignment length are given for the initial dataset of 116 marker genes. NOGs removed and retained from the marker gene set are indicated after each of the two rounds of tree refinement. The protein sequence identifiers are listed for sequences removed based on tree refinement due to either being duplicates or partial sequences, and those that could represent HGT events, contamination, or distant paralogs. Discordance scores for each of the 79 gene markers that passed tree refinement are given, and those subsequently removed alongside the 74 marker genes selected as the final dataset are indicated.
Data S4. Summary of the stepwise removal of the most compositionally heterogeneous sites from alignments with 184 (a), 183 (b), and 180 (c) taxa in 1% increments. The corresponding alignment length, percentage of the alignment removed, and number of taxa significantly divergent in composition is given for each step, alongside the χ2 test score for each taxon. Dark green indicates alignments corresponding to trees shown in Figures 1 and S3-S5, while light green indicates alignments where ML phylogenies were also inferred and used to assess the monophyly of different groups in Figures S3-S5.

Data S5.
Overview of all species phylogenies inferred, with the number of taxa, percentage of the total alignment pruned, alignment length, model of evolution, inference method and supports, and where the phylogeny can be found, with corresponding page numbers for Data S6. Data S6. Uncollapsed trees for all species phylogenies inferred including both ML (both PMSF non-parametric bootstrap and TBE supports) and Bayesian trees (all chains, and relevant consensus trees with posterior probabilities). Uncollapsed ML trees for subunits of proton-transporting NADH dehydrogenase (NuoA-N), cytochrome o ubiquinol oxidase (CyoA-D), and proton-driven ATP synthase (AtpA-H).
Data S7. Summary of key ancestor nodes and the corresponding ancestor abbreviation and included taxonomic groups (a) alongside the percentage of genes mapped to NOGs, the number of genes per taxon included in the reconciliation, singletons, and those excluded per genome (b). Overview of the number of events at each node inferred using raw summed events (c) and different cutoffs (0.1 to 0.9, in 0.05 increments; 0.3 selected) for duplications (d), transfers (e), losses (f), originations (g), and copies (i.e., ancestral proteome size) (h). The sum of gains and losses across COG functional categories at a reconciliation frequency of 0.3 (i) Data S8. Ancestral genome content reconstructions based on the gene-tree unaware method Count. Includes inferred copy number per gene family per node in the species phylogeny, if the inferred copy number was larger than 0. Data S9. Summary of annotations of gene content of selected PVC ancestors: all annotations (a), LPVCCA (b), LVCCA (c), LVCA (d), LCCA (e), LG1CA (f), LG2CA (g). Includes annotation information from eggNOG, PFAM, TIGRFAM, and Interpro databases per gene family in addition to noted manual curation for important genes per ancestor referred to in Figure 2 and Extended Data Figure 2.
Data S10. An overview of gene annotations presented in Figures 2-3 and Extended Data Figures  S5-S6 is also outlined for genes and complexes related to: the electron transport chain (a), the TCA cycle and fermentation (b), and other key genes and pathways of interest (c), alongside inferred copy number (hence presence) across all Chlamydiae ancestors and major PVC ancestors. Data S11. Taxonomic groups affiliated with chlamydiae in gene trees for inferred originations, and representing the putative HGT donor lineage (taxonomy of ≥ 75% of sequences in sister clade) and visualized in Extended Data Figure 4 (a). Domain, superphylum, and phylum affiliation of each chlamydial gene family origination based on different cutoffs of percentage taxa (75, 90, and 100%) in a supported monophyletic clade sister to chlamydial sequences, and in the clade sister to this group (nested) (b).